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1  Introduction 

1.1  Speaker  Identification 

Speech  communication  is  the  transfer  of  information  from  one  person’s  mouth  to  another’s  ear  using  sound. 
Sound  is  variations  in  pressure  which  propagate  as  waves  through  the  air  medium  and  reach  the  listener’s 
ear.  The  listener  then  deciphers  these  waves  into  a  received  message.  Speaker  recognition  refers  to  the 
concept  of  recognizing  a  speaker  by  the  sound  of  his/her  voice.  In  automatic  speaker  recognition,  an 
algorithm  takes  the  listener’s  role  in  deciphering  speech  waves  into  either  the  underlying  textual  message 
or  a  hypothesis  concerning  the  speaker’s  identity.  When  the  task  is  to  identify  the  person  talking  rather 
than  what  the  person  is  saying,  the  speech  signal  must  be  processed  to  extract  measures  of  speaker 
variability  instead  of  segmental  featmes.  There  are  two  methods  for  using  speaker  recognition  technology, 
namely  speaker  identification  (ID)  and  speaker  verification.  Some  of  the  important  applications  of  speaker 
recognition  include  customer  verification  for  bank  transactions,  access  to  bank  accounts  through  telephones, 
control  on  the  use  of  credit  cards,  and  for  security  purposes  in  the  Army,  Navy  and  the  Air  Force.  Speaker 
recognition  is  described  in  detail  in  [60]  [62]. 

Speaker  identification  deals  with  a  situation  where  the  person  has  to  be  identified  from  among  a 
predetermined  set  of  persons  by  using  his/her  voice  samples.  The  objective  of  speaker  verification  is 
to  verify  the  claimed  identity  of  that  speaker  based  on  the  voice  samples  of  that  speaker  alone.  For 
speaker  recognition,  the  acoustic  aspects  of  what  characterizes  the  differences  between  voices  are  obscure 
and  difficult  to  separate  from  signal  aspects  that  reflect  segment  recognition.  There  are  three  sources  of 
variation  among  speakers.  They  are 

•  Differences  in  vocal  cords  and  vocal  tract  shape; 

•  Differences  in  speaking  style;  and 

•  Differences  in  what  speakers  choose  to  say  [66] . 

Automatic  speaker  recognizers  exploit  only  the  first  two  variation  somces,  examining  low-level  acoustic 
features  of  speech,  since  a  speaker’s  tendency  to  use  certain  words  and  syntactic  structures  (the  third 
source)  is  difficult  to  quantify  or  control  in  an  experiment.  Most  of  the  parameters  and  features  used  in 
speaker  recognition  problem  contain  information  useful  for  the  identification  of  both  the  speaker  and  the 
spoken  message.  The  speaker  ID  problem  may  fmther  be  classified  into  closed  set  and  open  set.  Closed  set 
speaker  ID  problem  refers  to  a  case  where  the  speaker  is  known  a  priori  to  belong  to  a  set  of  M  speakers, 
whereas  in  open  set,  the  speaker  may  be  out  of  the  set.  In  open  set  problems,  a  scheme  is  used  wherein  a 
threshold  value  is  needed  in  order  to  find  out  if  the  speaker  is  out  of  the  set  of  M  speakers. 

1.1.1  Feature  Extraction 

Feature  extraction  is  the  process  of  deriving  a  compact  set  of  parameters  that  are  characteristic  of  a  given 
signal.  These  parameters  are  desired  to  preserve  all  the  information  relevant  to  the  application,  and  to 
have  no  redundancy  in  representing  the  signal.  For  speaker  identification,  a  desired  set  of  features  is  one 
that  minimizes  the  intraspeaker  variance  and  at  the  same  time  maximizes  the  interspeaker  variances. 

The  majority  of  speaker  identification  systems  use  some  type  of  short  time  spectral  analysis  followed  by 
a  certain  transformation  as  a  feature  extraction  step.  Due  to  the  short  time  stationarity  of  speech  signals, 
short  time  spectral  analysis  is  applied  to  overlapping  segments  (frames)  of  length  10  -  30  msec  and  and 
overlap  of  to  |  the  segment  length.  The  short  time  spectrum  is  transformed  into  a  sequence  of  feature 
vectors  that  compactly  represents  the  underlying  speech  signal. 
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Figure  1:  A  typical  histogram  of  the  frame  log-energies  of  conversational  speech 


The  most  eflFective  and  widely  used  spectral  analysis  techniques  as  front  end  processors  for  speech  and 
speaker  recognition  applications  are  LP  analysis  and  filter  bank  analysis.  For  the  reasons  listed  below,  this 
paper  focuses  on  the  LP-based  front  end  processor. 

Before  performing  any  spectral  analysis  on  the  signal  some  preprocessing  is  required.  Most  important 
is  speech /silence  discrimination.  The  algorithm  for  silence  removal  is  now  described  and  then  followed  by 
descriptions  of  the  different  features  that  can  be  used  by  the  system. 

Silence  Removal 

Speech /silence  discrimination  is  achieved  by  signal-dependent  energy  thresholding.  For  each  utter¬ 
ance,  the  energy  threshold  is  determined  by  constructing  the  histogram  of  the  frame  log-energies.  Only 
frames  with  log-energies  higher  than  the  determined  decided  threshold  are  kept  for  further  processing.  The 
threshold  is  determined  based  on  the  fact  that  for  spontaneous  speech,  an  utterance  typically  contains  a 
fair  amount  of  silence  or  nonspeech.  Therefore,  the  histogram  of  the  frame  log-energies  shows  a  bimodal 
distribution  as  shown  in  figure  1.  The  distribution  in  the  lower  end  of  the  log-energy  axis  corresponds  to 
the  silence,  and  the  other  distribution  corresponds  to  the  speech.  The  threshold  point  between  silence  and 
speech  is  chosen  somewhere  between  the  means  of  the  distributions.  The  SNR  of  the  signal  is  computed 
based  on  the  determined  threshold.  Following  the  speech/silence  discrimination  step,  the  speech  is  pro¬ 
cessed  by  a  single-tap  high  frequency  preemphasis  filter,  and  partitioned  into  30  ms  Hamming  windowed 
overlapping  frames  at  a  rate  of  100  frames /sec. 

Linear  Prediction  Model  for  Feature  Extraction 

Besides  being  a  good  estimate  to  the  source-filter  speech  production  model  [63],  LP  analysis  [30,  31] 
gained  its  popularity  for  the  following  reasons; 

•  It  is  analytically  and  computationally  tractable. 
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•  It  provides  spectral  estimates  that  are  less  biased  and  have  lower  variability  than  Fourier-based 
spectral  estimates. 

•  In  speech-pattern  recognition  applications,  it  has  been  found  that  LP-based  front  ends  provide  com¬ 
parable  or  better  performance  than  filter  bank  front  ends  [32].  Also,  for  speaker  identification  we 
have  found  in  agreement  with  [33,  34]  that  LP  based  front  ends  perform  at  least  as  good  as  filter 
bank  front  ends. 


The  short-time  transfer  function  of  the  LP  model  is  given  by: 


H{z;  m) 


A{z;  m)  1  +  ai{m)z^ 


(1) 


where  m  is  the  frame  index  representing  the  time  dimension,  P  is  the  order  of  the  LP  model,  and  ai{m)  is 
the  set  of  prediction  coefficients  of  the  frame.  A{z;  m)  is  the  short-time  LP  polynomial. 

Several  sets  of  features  can  be  derived  from  H(z;m)  [35,  59,  37]. 

Atal  [35]  provided  a  comparison  of  parameters  obtained  from  linear  prediction,  the  impulse  response, 
autocorrelation,  vocal  tract  area  function,  and  cepstral  coefficients  and  found  the  cepstrum  to  provide  the 
best  results  for  speaker  recognition. 

Another  comparison  between  cepstrum  and  log  area  ratios  (LARs)  [59]  for  speaker  verification  concluded 
that  cepstral  coefficients  outperform  the  LARs.  For  high  quality  speech,  line  spectral  pairs  (LSPs)  are  found 
to  yield  speaker  identification  rates  that  are  comparable  to  or  better  than  those  of  the  cepstral  coefficients 
[37].  However,  for  telephone  quality  speech,  cepstral  coefficients  are  found  to  be  superior  to  LSPs.  Today, 
cepstral  coefficients  are  the  dominant  features  used  for  speaker  recognition  [59,  48,  39]. 

In  most  practical  applications,  speech  is  collected  under  different  environments  and  possibly  through 
different  communications  channels.  This  causes  a  mismatch  among  corresponding  reference  and  testing 
patterns.  The  characteristics  of  the  cepstral  coefficients  have  been  extensively  studied  for  the  purpose 
of  minimizing  such  a  mismatch.  In  this  regard,  two  major  postprocessing  steps  have  been  introduced: 
intraframe  processing  known  as  cepstral  weighting  or  liftering  [32,  41,  42],  and  interframe  processing  which 
exploits  the  time  evolution  of  the  cepstral  coefficients  [45,  46,  47,  64].  The  ACW  scheme  introduced  in  this 
paper  falls  within  the  intraframe  processing  techniques. 

Due  to  the  importance  of  the  LP  cepstral  features  in  speech  and  speaker  recognition,  we  dedicate  a 
separate  section  to  discuss  their  properties  and  their  relations  to  other  LP  parameters. 


LP  Cepstrum 

The  short-time  LP  cepstrum  is  defined  as  the  inverse  z  transform  of  the  natural  logarithm  of  the  short- 
time  LP  transfer  function  H{z;m).  It  can  be  viewed  as  the  impulse  response  of  In  H(z;m)  which  is  given 
by: 


00 

lnjFf(2;  m)  =  ^  c„(m)2~”  (2) 

n=l 

where  c„  (m)  is  the  n**  cepstral  coefficient  of  the  m**  fi-ame. 

A  simple  and  unique  recursive  relationship  between  Cn(m)  and  the  prediction  coefficients  a„(m)  can 
be  obtained  by  differentiating  both  sides  of  the  of  (2)  with  respect  to  z~''-  and  equating  the  coefficients  of 
equal  powers  of  z~^.  This  relation  is  given  by  [35] 

ci(m)  =  — ai(m), 

A; 

Cnim)  =  -anim)  +  - l)ak{m)cn-k{m),  I  <  n  <  P, 

1  ” 
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n>  P. 


(3) 


n— 1 


Cn("^)  =  - '^hk{'m)cn-kim), 

,,  n 


k=l 


An  alternative,  rather  more  insightful,  method  of  obtaining  the  short-time  cepstral  coefRcients  is  by 
relating  them  to  the  poles  of  H{z]  m)  and  hence  to  the  resonance  center  frequencies  and  bandwidths. 


H{z;m)  — 


1 


(4) 


By  substituting  (4)  in  (2)  one  gets 

p 


2=1  n=l 

The  factor  ln(l  —  Zi{m)z~^)  can  be  expanded  [65]  as 

00 

ln(l  -  zi{m)z-^)  =  -Y, 


(5) 


(6) 


n=l 


By  combining  (5)  and  (6).  Cn{m)  can  be  expressed  in  terms  of  the  roots  of  the  LP  polynomial  as  follows. 

p 


Cn{m)  =  - 

T7  ^ 


71  ^  . 
2=1 


(7) 


Thus  Cnim)  can  be  interpreted  as  the  power  sum  of  the  LP  polynomial  roots  normalized  by  the  cepstral 
index  [40]. 

Since  Zi{m)  is  associated  with  time  varying  center  frequencies  uji{m)  and  bandwidths  Bi(m)  by  the 
relation 

Ziim)  =  (8) 

the  short  time  cepstral  coefRcients  can  be  expressed  as: 


1  P 

c„(m)  =  -i  y 


n  ^  . 
2=1 


1  P 


(9) 


Thus  the  cepstral  coefficient  can  be  interpreted  as  a  nonlinear  transformation  of  the  resonance  center 
frequencies  and  bandwidths. 


Robust  Feature  Processing 

Cepstral  Features  are  found  to  yield  excellent  performance  for  text-independent  speaker  identifica¬ 
tion  when  training  and  testing  speech  are  collected  under  relatively  high-quality  stationary  environments. 
However,  in  practical  applications,  the  speech  used  by  the  system  is  subject  to  various  sources  of  degrada¬ 
tions,  such  as  background  noise  and  communications  channels  variability.  Such  degradations  often  result 
in  reduced  recognition  rates.  This  is  due  to  the  mismatch  created  among  corresponding  reference  and 
testing  patterns  (in  this  case  cepstral  featmes  Cn(m)).  To  minimize  rhis  mismatch,  two  major  cepstral 
postprocessing  approaches  have  been  introduced:  intraframe  and  interframe  processing. 

Intraframe  Processing 
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Intraframe  processing  is  also  known  as  cepstral  weighting  or  liftering.  The  rationale  behind  cepstral 
weighting  is  to  account  for  the  sensitivity  of  the  low-order  cepstral  coefficients  to  the  overall  spectral  slope 
and  the  sensitivity  of  the  high-order  cepstral  coefficients  to  noise.  In  this  regard  several  fixed  weighting 
schemes  have  been  recently  introduced  [43].  By  ‘^fixed  weighting”  we  mean  that  the  applied  weights  are 
only  a  function  of  the  cepstral  index  n.  Therefore  these  weights  are  fixed  with  respect  to  the  frame  index 
m.  Generally,  the  resulting  weighted  cepstrum  is  given  by 

Cnim)  =WnCnim),  (10) 


where  Wn  is  the  cepstral  weighting  window  (also  known  as  the  lifter) . 

The  simplest  and  most  straightforward  weighting  sequence  is  the  rectangular  weights,  which  have  the 
effect  of  truncating  the  infinite  cepstral  sequence.  Cepstral  truncation  has  the  effect  of  slighthj  smoothing 
the  LP  spectra.  We  say  slightly  because  the  LP  cepstra  are  smooth  by  nature  and  hence  need  a  relatively 
small  number  of  parameters  to  determine  them.  Alternatively,  the  truncation  is  justifiable  due  to  the  fact 
that  the  LP  cepstrum  is  the  sum  of  exponentially  decaying  sequences  that  can  be  sufficiently  represented 
by  a  finite  number  of  terms  L.  Since  for  a  P**  order  all-pole  spectrum  the  first  P  cepstral  coefficients 
uniquely  determine  that  spectrum,  L  is  usually  chosen  to  be  equal  to  or  greater  than  P.  The  advantages 
of  cepstral  truncations  are: 

•  to  reduce  the  dimensionality  of  the  cepstrum  so  as  to  be  usable  as  a  feature  vector,  and 

•  to  suppress  the  variability  of  the  higher  order  cepstral  coefficients.  ' 


Other  more  sophisticated  weighting  schemes  that  take  advantage  of  the  statistical  characteristic  of 
the  cepstral  coefficients  have  been  recently  introduced,  These  included  bandpass  liftering  (BPL)  [41]  and 
quefrency  liftering  [42,  44]. 

BPL  weights  a  cepstral  sequence  by  a  raised  sine  function  of  the  form 

„  1  +  n  =  l,  2,  ...  ,L 

^1  0  otherwise 


where  L  is  normally  chosen  to  be  greater  than  P,  the  order  of  the  LP  model.  The  attenuation  of  the 
low-order  coefficients  is  based  on  the  fact  that  these  coefficients  are  more  susceptible  to  channel  variations. 
The  attenuation  of  the  high-order  coefficients  is  based  on  the  same  reason  given  for  truncation. 

Quefrency  liftering  applies  an  asymmetric  triangular  window  of  the  form 


n  n  =  1.  2,  ...  ,  L 
0  otherwise 


(12) 


where  n  is  the  cepstral  index. 

This  weighting  is  based  on  the  hypothesis  that  the  standard  deviations  of  the  cepstral  coefficients  are 
inversely  proportional  to  their  cepstral  index  n.  Thus,  this  weighting  scheme  approximates  the  statistical 
normalization  approach  which  is  accomplished  by  multiplying  each  vector  of  an  array  of  vectors  by  the 
inverse  of  the  covariance  matrix  of  that  array.  Here  the  covariance  matrix  is  assumed  to  be  diagonal  which 
is  a  valid  assumption  for  the  cepstral  features.  Other  variations  of  quefrency  liftering  such  as  trapezoidal 
and  symmetric  triangular  were  also  used  [43].  These  lifters  were  used  to  account  for  the  sensitivity  of  high 
order  cepstral  coefficients  to  noise. 

It  should  be  noted  here  that  fixed  cepstral  weighting  can  be  incorporated  in  the  distance  measure 
between  two  unweighted  vectors.  For  example,  a  weighted  Euclidean  distance  measure  between  Cn{i)  and 
CnU)  is  given  by 
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(13). 


L 

di,j  =  Y^w{n)^{Cn{i)  -CnU))^ 
n=l 

The  above  mentioned  fixed  weighting  schemes  apply  fixed  weights  to  all  the  feature  vectors  extracted 
from  an  utterance  assuming  that  all  the  frames  undergo  the  same  distortion.  This  assumption  is  not  always 
applicable  since  in  many  practical  cases  distortions  vary  with  time.  In  such  cases  an  adaptive  weighting 
scheme  that  is  capable  of  adapting  to  the  time-varying  nature  of  the  distortions  is  desired. 

In  the  following  we  introduce  a  new  adaptive  weighting  scheme  which  results  in  a  new  set  of  cepstral 
features  that  show  robustness  to  channel  variations. 

Interframe  Processing 


Log  Area  Ratio  The  next  feature  investigated  is  the  Log  Area  Ratio  parameters.  The  Log  Area  Ratios  are 
a  bilinear  transform  of  the  reflection  coefficients  and  are  considered  a  very  efficient  transformation  of  the 
reflection  coefficients  for  purposes  of  speech  coding.  They  are  related  to  the  reflection  coefficients  by  the 
formula: 


1  1  j-  lid) 

9{l)  =  2  fori  =  1.2..  (14) 

These  log  area  ratios  correspond  to  the  cross  sectional  areas  of  the  different  sections  of  the  vocal  tract 
filter  estimated  by  linear  prediction. 

ACW  Cepstrum 

The  ACW  scheme  modifies  the  LP  spectrum  so  as  to  emphasize  the  formant  structure.  This  is  achieved 
by  operating  on  the  different  components  of  the  spectrum,  namely  by  amplifying  the  narrow-bandwidth 
components  and  attenuating  the  broad-bandwidth  components.  The  resulting  modified  spectrum  intro¬ 
duces  zeros  to  the  LP  all-pole  model.  This  is  equivalent  to  a  FIR  filter  that  normalizes  the  contribution  of 
the  dominant  modes  of  the  signal  (the  formants). 

For  a  given  speech  frame,  the  all-pole  model  can  be  expressed  in  a  parallel  form  by  partial  fraction 
expansion: 


H{z)  = 


1 

1  +  EiLl  OiZ 


n 

{i-ziz-^y 


(15) 


Since  each  pole  Zi  represents  the  center  frequency  w,  and  the  bandwidth  Bi  of  the  component,  each 
component  can  be  fully  parameterized  by  Wj,  Bi,  and  expansion  residue,  rj. 

The  sensitivity  of  the  parameters  (cjj,  Bj,  r^)  with  respect  to  channel  variations  have  been  experimentally 
evaluated  by  virtue  of  the  the  following  experiment: 


•  A  voiced  frame  of  speech  is  processed  through  a  random  single-tap  channel  given  by: 

0j(2()  1  (16) 

where  aj  is  a  sequence  of  uniformly  distributed  random  numbers  between  0.0  and  1;0. 

•  The  sequences  of  the  parameters  (cjj,  Bi.  Vi)  of  all  components  are  computed  for  each  j  in  the  random 
sequence  aj. 
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Figure  2:  Block  diagram  of  the  experiment  of  the  sensitivity  of  the  LP  spectral  component  parameters 
with  respect  to  a  random  single-tap  channel. 


•  Two  sequences  of  (wj,  Bi,  rj)  are  selected  to  represent  a  narrow-bandwidth  component  and  a  broad- 
bandwidth  component. 

•  The  sensitivity  of  the  parameters  of  the  selected  narrow-bandwidth  and  broad-bandwidth  components 
is  evaluated  by  histogram  analysis. 

The  block  diagram  of  the  experiment  is  shown  in  figure  2. 

By  examining  the  resulting  histograms  of  the  parameters  of  the  broad-bandwidth  component  shown 
in  figure  (3),  one  concludes  that  the  three  parameters, (oij,  Bj,  Ti),  associated  with  such  components  show 
large  variances  with  respect  to  channel  variations.  Therefore,  under  channel  variations,  such  components 
introduce  undesired  variability  to  the  LP  spectrum  that  results  in  a  mismatch  among  testing  and  training 
patterns. 

Narrow-bandwidth  components  tend  to  preserve  their  center  frequencies  and  bandwidths  since  their 
histograms  show  small  variances.  However,  the  values  of  their  residues  demonstrate  large  variances.  This 
effect  is  shown  in  figure  (4). 

These  observations  suggest  guidelines  to  modify  the  LP  spectrum  so  as  to  be  robust  for  such  variations. 
The  modifications  should  be  aimed  at: 

•  eliminating  the  residues  rj  from  the  LP  spectrum,  and 

•  attenuating  the  contribution  of  the  broad-bandwidth  components. 

One  way  of  achieving  the  suggested  modifications  is  to  normalis^e  the  residues,  {rj},  for  example  by  setting 
Tj  =  constant,  which  can  be  viewed  as  weighting  the  component  by  Normalizing  {rj}  results  in  a 
modified  spectrum  which  we  refer  to  as  the  ACW  spectrum.  The  ACW  spectrum  is  given  by 


N{z) 


where  ^  ^ 

k=l  izzl^k 

which  can  be  defactorized  into  the  form 

p-i 

Niz)  =  P(H-  biZ-% 

Z=1 


(17) 

(18) 


(19) 
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Figure  3:  Histograms  of  the  parameters  of  a  broad-bandwidth  component. 
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Figure  4:  Histograms  of  the  parameters  of  a  narrow-bandwidth  component. 
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(20) 


By  this  modification  to  the  LP  spectrum  the  peak-value  of  each  component  is  given  by 

1  1  1 
{1  -  ZiZ~^)  '  l-\zi\  Bi 

Equation  (20)  shows  that  the  ACW  spectrum  emphasizes  the  formant  structure  by  weighting  each 
component  approximately  by  Thus  narrow-bandwidth  components  are  amplified  and  broad-bandwidth 
components  are  attenuated. 

H{z)  is  no  longer  an  all  pole  autoregressive  (AR)  transfer  function,  as  it  now  has  a  MA  filter  represented 
by  p  _  1  zeros.  This  MA  filter  introduced  by  normalizing  the  residues  can  be  viewed  as  a  FIR  filter.  This 
filter  creates  a  spectrum  whose  components’  peak  values  are  inversely  proportional  to  their  bandwidths. 
This  concept  is  illustrated  in  figure  (5)  where  the  components  of  the  LP  spectrum  H{z)  and  the  ACW 
spectrum  H{z)  for  a  voiced  speech  frame  are  shown. 

Figure  (6)  demonstrates  the  spectral  mismatch  created  by  a  single-tap  channel  by  showing  the  compo¬ 
nents  of  the  LP  spectrum  of  the  same  frame  used  in  figure  (5)  after  processing  through  (1  —  0.9;^:“^). 

In  figure  (7)  the  robustness  of  the  ACW  spectrum  is  demonstrated  by  showing  that  the  same  channel 
that  disturbed  the  components  of  the  LP  spectrum  has  a  very  small  effect  on  the  components  of  the  ACW 
spectrum. 

The  channel  effect  on  the  composite  LP  and  ACW  spectra  is  shown  in  figure  (8).  It  is  obvious  that 
^the  mismatch  between  the  LP  spectra  before  and  after  processing  through  the  channel  is  much  larger  than 
that  between  the  corresponding  ACW  spectra. 
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Figure  5:  Components  of  (a)  the  LP  spectrum,  and  (b)  the  ACW  spectrum  of  a  voiced  speech  frame 


12 


3 


2  Fast  ACW  cepstrum 


Concept  of  ACW 

A  pth  order  LP  analysis  of  speech  leads  to  an  LP  polynomial  A{z)  and  an  all-pole  model  H{z)  =  I /A{z) 
of  the  speech  spectrum.  The  polynomial  A{z)  is  expressed  as 


Mz)  =  1  -  53  *  =  JJ(1  -  fiZ  (21) 

which  in  turn  can  be  guaranteed  to  be  minimum  phase  by  the  autocorrelation  method  of  LP  analysis. 
The  conventional  LP  cepstrum  cip(n)  is  defined  for  n  >  0  and  can  be  found  by  a  recursion  involving  the 
coefficients  Oj  [32]. 

The  approach  in  [67]  is  to  first  perform  a  partial  fraction  expansion  of  H{z)  to  get 


The  experiments  in  [67]  reveal  that  the  residues  rk  show  considerable  variations  especially  for  nonformant 
poles  when  the  speech  is  degraded  by  a  channel.  Therefore,  the  variations  in  rk  were  removed  by  forcing 
Tk  to  be  constant  =  1  for  every  k.  Hence,  we  get  a  pole-zero  system  function  of  the  form 


N{z)  P  1 

Mz)  1  -  fkZ~^ 

where 

k=:l 

which  can  be  further  written  as 

p-i 

Niz)  =p{l-J2bkZ~'^)  .  (25) 

*=1 

Therefore,  the  ACW  cepstrum  is  given  by 

^acwiP'')  ~  Cfin(^)  (26) 

for  n  >  0  where  Cnn{n)  can  be  found  by  a  recursion  [32]  involving  the  coefficients  6^. 

It  must  be  noted  that  the  present  method  of  finding  Cac«;(n)  from  A(z)  involves  the  following  steps  [67]. 

1.  Find  cip{n)  from  Oj 

2.  Determine  the  roots  of  A{z) 

3.  Find  all  the  cofactors  of  A(z)  of  order  p-1  and  add  them  up  to  get  N{z) 

4.  Find  Cnn(n)  from  6, 

5.  Find  Cacwin)  =  cip{n)  -  c„n(n) 

Steps  2  and  3  are  mainly  responsible  for  the  increase  in  computational  burden  over  merely  finding  C[p{n). 
As  we  shall  see  later,  this  increase  is  by  a  factor  of  1.4.  With  the  fast  algorithm  we  propose,  the  increase 
in  computation  is  a  very  small  factor  of  1.02. 


(23) 

(24) 
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2.1  Mathematical  Definition  of  Numerator  Polynomial 

Theorem:  Every  single  coefficient  6jt  of  N{z)  in  Eq.  (25)  is  of  the  form 


p-k 

h  = - ak 

P 

V/c,  1  <  A;  <  p  —  1  where  ak  is  the  kth  coefficient  of  the  LP  polynomial  A{z)  (see  Eq.  (21)). 
Proof:  Let 

k=l 

be  a  polynomial  of  order  p  with  roots  Zk-  The  derivative  of  Pp{z)  can  be  written  as 

dZ  dZ  Z  Z'i  Z  Zi 


(27) 


(28) 


(29) 


Vi,  1  <  i  <  p.  The  derivative  term  on  the  right  hand  side  in  the  above  equation  is  in  turn  a  polynomial  of 
order  p  —  1  and  therefore,  can  be  further  written  as 


d  Pp(z) 


_ 


=  {Z-  Zj) 


d 


Ppjz) 


+  ■ 


Pp{^) 


dz  (z  -  Zi)(z  -  Zj)  (z  -  Zi)(z  -  Zj) 


dz  z  —  z^, 

Vj,  1  <  j  ^  P  ^  j  ^  Therefore, 

—Pp{z)  =  (z-  Zi)(z  -  +  -f^j 

Thus,  by  induction,  the  final  expression  for  the  derivative  of  Pp{z)  is  obtained  as 


(30) 


(31) 


(32) 


By  inspecting  Eq.  (32),  it  is  obvious  that  the  derivative  of  a  polynomial  of  order  p  is  equal  to  the  sum  of 
all  the  polynomial  cofactors  of  order  p  —  1.  Since  N{z)  is  also  a  sum  of  all  the  LP  polynomial  cofactors  of 
order  p  —  1,  the  coefficients  in  Eq.  (25)  are  given  as  in  the  theorem. 


2.2  Minimum  Phase  Property  of  Numerator  Polynomial 

In  order  to  define  the  ACW  cepstrum  as  in  Eq.  (26),  it  is  necessary  and  sufficient  that  N{z)  be  minimum 
phase.  Here,  we  show  that  N{z)  is  minimum  phase. 

Theorem:  [68]  Let  A  G  Then,  each  eigenvalue  A  of  the  matrix  A  lies  in  one  of  the  disks  Di  in  the 

complex  plane 

p 

Di  =  {\:\X-aii\<  ^  \aij\}  (33) 

Vz,  1  <  f  <  p  where  are  the  elements  of  the  matrix  A. 

The  set  Di  are  disks  in  the  complex  plane  centered  at  au  and  of  radius  |ajj|.  They  are  called 

Gerschgorin  disks  of  the  matrix  A. 
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Suppose  that  the  real  matrix  A  has  the  companion  form  [69]: 

'01  0  0  •••00' 

0  0  1  0  •  •  •  0  0 

A  =  •  •  •  (34) 

00  0  0  •••10 

0  0  0  0  •••  0  1 

The  characteristic  polynomial  of  the  matrix  A  is  given  as: 

{z)  =  zP  +  ^  (35) 

i=l 

Due  to  the  companion  form  of  the  matrix  A,  there  are  only  two  different  Gerschgorin  disks 

p 

Di  =  {A:|A|<l}  and  D2  =  {A  :  |A  +  gi|  <  |a^-|}  (36) 

j=2 

containing  the  zeros  of  Pp{z).  Furthermore,  if  Pp{z)  is  minimum  phase,  then  Da  C  Di  (see  Fig.  1). 

The  derivative  of  Pp{z)  is  given  as: 

d 

^  (37) 

Z=1 

Its  roots  are  the  same  as  the  eigenvalues  of  the  (p  —  1)  x  (p  —  1)  matrix 

0  1  0  0  •••  0  O' 

0  0  1  0  •••  0  0 


(38) 

0  0  0  0  •••  1  0 

0  0  0  0  •••  0  1 

-Gp-i/p  -2ap_2/p  -3ap_3/p  —Aap-^/p  -(p-2)a2/p  -(p-l)ai/p 

Each  eigenvalue  of  the  matrix  A'  lies  in  one  of  the  following  two  disks: 


p-i 

•Oi  =  {A  :  |A|  <  1}  and  -Da  =  {A  :  |A  +  (p  -  l)ai/p|  <  V(p  -  j)|a,|/p}  (39) 


-  j)|aj|/p  <  and  (p  -  1)|gi|/p  <  |ai|  (40) 

3=2  j=2 

disk  D3  is  the  shrinked  version  of  the  disk  Da  with  its  center  translated  along  the  real  axis  toward  the 
origin.  Since  P^{z)  is  minimum  phase.  Da  C  Di.  Since  D3  is  a  shrinked  and  translated  (toward  the  origin) 
version  of  Da-  it  directly  follows  that  Da  C  D\  (see  Fig.  1).  Therefore,  the  minimum  phase  property  of 
the  derivative  of  Pp{z)  is  established. 

We  have  shown  that  if  a  polynomial  is  minimum  phase,  its  derivative  is  also  minimum  phase.  Hence, 
the  minimum  phase  property  of  A{z)  ensures  the  minimum  phase  nature  of  N{z). 


2.3  Computer  Time 

Speech  sampled  at  8  kHj;  served  as  input  to  a  system  that  does  LP  analysis  and  converts  the  LP  coefficients 
to  either  the  conventionaJ  cepstrum  or  the  ACW  cepstrum.  An  optimized  software  code  that  implements 
the  above  system  was  run  on  a  SPARCIO.  Three  different  scenarios  were  compared  in  terms  of  CPU  time. 
In  scenario  1,  the  LP  coefficients  were  transformed  into  c/p(n)  via  the  well  known  recursion.  In  scenario 
2,  the  LP  coefficients  were  transformed  into  Cacwi^)  by  the  method  offered  in  this  paper  for  finding  N{z) 
and  employing  two  separate  recursions  on  N{z)  and  A{z).  In  scenario  3,  the  LP  coefficients  were  again 
transformed  into  Cacwi'n)  but  unlike  scenario  2,  N(z)  was  found  by  a  standard  polynomial  root  finding 
program  [70].  The  ratio  of  the  required  computer  time  for  going  from  speech  to  cepstral  features  through 
scenarios  1,  2  and  3  is  1:1.02:1.40.  This  shows  that  our  proposed  method  is  much  faster  than  doing 
polynomial  root  finding.  Also,  the  more  robust  ACW  cepstrum  can  be  obtained  by  a  negligible  overhead 
as  compared  to  the  conventional  LP  cepstrum. 

Frame  Selection 

As  it  is  shown  above,  computing  the  component  information  (center  frequencies  and  bandwidths)  is  an 
intermediate  step  in  obtaining  the  ACW  features.  This  information  can  be  utilized  as  basis  for  selecting 
frames  to  be  included  in  the  sequence  of  feature  vectors  representing  a  given  speech  signal.  This  frame 
selection  criterion  is  based  on  the  following  reasons. 

•  Voiced  speech  carry  most  of  the  speaker-dependent  information. 

•  Speech  frames  of  spectra  that  have  apparent  formant  structure  are  the  most  discriminative  and 
noise-robust  frames. 


Depending  on  the  bandwidth  of  a  given  speech  signal,  one  can  devise  a  criterion  for  frame  selection  based 
on  the  formant  information.  This  criterion  can  be  summarized  as  follows.  Frames  that  have  certain  number 
of  resonances  that  lie  within  a  specified  frequency  range,  and  have  bandwidths  smaller  than  a  specified 
threshold  are  selected.  This  concept  is  depicted  in  figure  (10). 


Interframe  Processing 

Unlike  intraframe  processing,  interframe  processing  exploits  the  temporal  variability  of  a  sequence  of  feature 
vectors.  The  rationale  behind  interframe  processing  can  be  summarized  by  the  following  reasons: 

•  To  emphasize  the  transitional  information  which  is  believed  to  provide  orthogonal  information  to  the 
instantaneous  features  obtained  from  the  intraframe  processing  [45]. 

•  To  compensate  for  stationary  and  slowly  varying  linear  channel  effects  that  result  in  severe  mismatch 
between  training  and  testing  data.  This  is  achieved  by  removing  time-invariant  spectral  information. 


Transitional  information  is  often  referred  to  as  dynamic  features.  In  this  paper  we  limit  our  discussion 
to  interframe  processing  methods  associated  with  spectral  features  in  the  cepstral  domain. 

It  has  been  shown  in  [35]  that  the  effect  of  any  fixed  frequency  response  distortion  introduced  by  the 
recording  apparatus  or  the  transmission  channel  can  be  eliminated  from  a  cepstral  sequence  simply  by 
subtracting  its  long-term  mean.  It  is  interesting  to  notice  that  subtracting  the  long-term  mean  in  the 
cepstral  domain  is  equivalent  to  dividing  by  the  geometric  mean  in  the  spectral  domain: 


^  M 

Cfiik)  ~  *^(^)  ^ 


(41) 
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Frames  that  have  a  certain  number  of  poles 
that  lie  within  the  shaded  region  are  selected. 


Figure  10:  Frame  selection  based  on  formant  information. 


Spectral  dynamic  features  are  often  represented  by  the  time  differential  information  of  the  cepstral 
sequence.  The  most  straightforward  representation  being  the  first  difference: 

Acn{m)  =Cn{m) -Cn{rn-l).  (42) 

However,  first  difference  is  susceptible  to  noise  since  it  amplifies  the  high  frequency  components  in  the 
time  trajectories  of  the  cepstral  coefficients.  Therefore,  the  time  derivative  of  c„(m)  is  approximated  by 
polynomial  approximation  [59].  This  approximation  has  the  effect  of  bandpass  filtering  the  time  trajectories 
of  Cn{rn)  instead  of  the  highpass  filtering  effect  of  the  first  difference.  The  filtered  coefficients  are  known 
as  the  delta-cepstral  coefficients  and  are  given  by: 

K 

Ac„(m)  =  ^  Cn{m-k)5{k),  (43) 

k=-K 

where  6{k)  represent  the  impulse  response  of  the  2K  +  I  taps  bandpass  filter  which  approximates  the 
derivative  of  Cn{m).  The  filter  taps  are  given  by: 

^(fc)  =  --r^  k  =  -K.,  .-K  +  l,  K  (44) 

22k=-K  « 

A  typical  value  for  if  is  2  or  3.  This  technique  is  also  known  by  2K  +  1  point  regression- line. 

Another  intraframe  processing  technique  is  known  as  RASTA  (RelAtive  SpecTrA)  [46].  Similar  to  the 
delta-cepstrum,  RASTA  has  the  effect  of  bandpass  filtering  the  time  trajectories  of  the  cepstral  coefficients. 
However,  the  RASTA  filter  includes  a  first  order  autoregression  which  has  the  effect  of  recursively  removing 
the  temporal  average  of  the  cepstral  sequence.  It  also  results  in  smoother  cepstral  trajectories  due  to  the 
low-pass  nature  of  the  first  order  autoregression.  The  RASTA  LP  cepstrum  AnCnim)  is  given  by 

K 

AftCri(m)  =  ^  Cn{m  -  k)S{k) +aA[iCn{m -1)  (45) 

k=-K 

where  a  is  the  coefficient  of  the  first  order  autoregressive  filter.  This  coefficient  has  a  typical  value  of  0.98 
[46].  To  show  the  effect  of  intraframe  processing  on  the  short-time  cepstral  trajectories  in  the  frequency 
domain,  the  frequency  responses  of  the  first  difference,  delta-cepstrum,  and  the  RASTA  filter  are  shown 
in  figure  (11).  Notice  that  the  frame  sampling  frequency  is  100  frames/sec. 

2.3.1  Speaker  Modeling 


Neural  Tree  Network 

The  supervised  classifier  considered  here  is  the  modified  neural  tree  network  (MNTN)  [49].  The  NTN 
[56]  is  a  hierarchical  classifier  that  combines  the  properties  of  decision  trees  [57]  and  feed-forward  neural 
networks  [58].  Whereas  the  NTN  is  strictly  a  classification  tree,  i.e.,  only  the  leaf  labels  are  used,  the 
MNTN  additionally  uses  probability  measures  at  the  leaf  nodes. 

The  assignment  of  probability  measures  occurs  within  a  technique  called  forward  pruning.  The  forward 
pruning  algorithm  consists  of  simply  truncating  the  growth  of  the  tree  beyond  a  certain  level.  For  the 
leaves  at  the  truncated  level,  a  vote  is  taken  and  the  leaf  is  assigned  the  label  of  the  majority.  In  addition 
to  a  label,  the  leaf  is  also  assigned  a  confidence.  The  confidence  is  computed  as  the  ratio  of  the  number 
of  elements  for  the  vote  winner  to  the  total  number  of  elements.  The  confidence  provides  a  measure  of 
confusion  for  the  different  regions  of  feature  space.  The  concept  of  forward  pruning  is  illustrated  in  Figure 
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Figure  11:  Frequency  responses  of  various  interframe  filters 
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Figure  12;  Forward  Pruning  and  Confidence  Measures 
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A  MNTN  can  be  trained  for  each  speaker  in  the  population  as  follows.  First,  the  MNTN  for  each 
speaker  is  presented  with  a  training  set  that  is  comprised  of  the  data  for  all  speakers.  Here,  the  extracted 
feature  vectors  for  that  speaker  are  labeled  as  “one”  and  the  extracted  feature  vectors  for  everyone  else  are 
labeled  as  “zero”.  A  binary  MNTN  for  speaker  i  is  then  trained  with  this  data.  This  procedure  is  repeated 
for  all  speakers  in  the  population. 

Specifically,  a  trained  MNTN  can  be  applied  to  speaker  recognition  as  follows.  Given  a  sequence  of 
feature  vectors  x  from  the  test  utterance  and  a  trained  MNTN  for  speaker  Si,  the  corresponding  speaker 
score  is  found  as: 

PMNTNix\Si)  =  ^ 

where  and  c°  are  the  confidence  scores  for  the  speaker  and  antispeaker,  respectively.  Here,  the  M  and 
N  correspond  to  the  number  of  vectors  classified  as  the  “one”  and  “zero”,  respectively. 

Vector  Quantization 

The  unsupervised  classifier  considered  here  is  vector  quantization  (VQ).  The  VQ  algorithm  is  based 
on  clxistering.  This  falls  under  the  category  of  unsupervised  training,  i.e.,  the  class  label  is  not  used. 
Clustering  will  automatically  group  the  training  data  into  its  individual  modes  or  classes.  Numerous  VQ 
algorithms  exist,  including  the  Linde-Buzo-Gray  (LBG)  [54]  method  and  K-means  algorithm  [55].  The 
LBG  method  is  used  here.  ' 

The  VQ  classifier  can  be  used  for  speaker  recognition  [48]  as  follows.  Given  the  extracted  feature  vectors 
from  a  speaker,  a  codebook  is  constructed  for  that  speaker.  This  process  is  repeated  for  all  speakers  in  the 
population.  For  speaker  identification,  the  feature  vectors  from  a  test  utterance  are  applied  to  each  of  the 
codebooks.  For  a  given  codebook,  the  centroid  that  is  closest  to  the  test  vector  is  found  and  the  distance 
to  this  centroid  is  accumulated  for  that  codebook.  This  process  is  repeated  for  all  test  vectors  and  the 
speaker  is  selected  as  corresponding  to  the  codebook  with  the  minimum  accumulated  distance. 
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Figure  13:  Data  Fusion  System 


3  Data  Fusion 

It  is  often  advantageous  to  combine  the  opinions  of  several  experts  when  making  a  decision.  For  example, 
when  one  is  obtaining  a  medical  diagnosis,  a  decision  for  subsequent  care  may  become  easier  after  obtaining 
several  opinions  as  opposed  to  just  one.  This  concept  has  been  exploited  in  the  field  of  data  fusion  for 
tasks  including  handwriting  recognition  [50],  remote  sensing  [52],  etc. 

The  general  form  of  a  data  fusion  system  is  illustrated  in  Figure  13.  Given  a  set  of  feature  vectors, 
each  expert  outputs  its  own  observation,  which  can  consist  of  a  probability  measure,  class  label,  etc.  The 
combiner  will  then  use  one  of  many  methods  to  collapse  these  observations  into  a  single  decision.  The  set  of 
feature  vectors  can  also  be  different  from  classifier  to  classifier,  which  would  be  a  case  of  sensor  fusion  [51]. 
However,  the  work  in  this  chapter  only  considers  the  .case  for  different  experts  and  not  different  features. 

There  are  numerous  ways  to  combine  the  opinions  of  multiple  experts.  For  example,  if  the  outputs  of 
all  experts  are  probabilities  then  a  simple  combination  method  would  be  to  take  a  weighted  sum  of  the 
probabilities  or  of  the  logs  of  the  probabilities.  These  methods  are  known  as  the  linear  opinion  pool  and 
log  opinion  pools  [52].  If  the  outputs  of  the  experts  are  class  labels,  then  methods  such  as  voting  [50]  or 
ranking  [53]  can  be  used.  For  fuzzy  decisions,  Dempster- Shafer  theory  can  also  be  used  for  the  combination 
of  experts  [50].  This  chapter  evaluates  the  linear  and  log  opinion  pool  methods  for  speaker  recognition. 
These  methods  are  described  in  more  detail  as  follows. 

3.1  Linear  Opinion  Pool 

The  linear  opinion  pool  is  a  commonly  used  data  fusion  technique  that  is  convenient  due  to  its  simplicity. 
The  linear  opinion  pool  is  evaluated  as  a  weighted  sum  of  the  classifier  outputs: 

n 

Plinearip^^  ^  ^  ^ 

where  Puneari^)  is  the  probability  of  the  combined  system,  ai  are  weights,  Pi{x)  is  the  probability  of  the 
individual  classifier,  and  n  is  the  number  of  classifiers.  For  all  experiments  in  this  paper,  a  is  between  zero 
and  one  and  the  sum  of  the  a’s  is  equal  to  one. 
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Figure  14:  Data  Fusion  Approach 

The  linear  opinion  pool  is  appealing  in  that  the  output  is  a  probability  distribution  and  the  weights 
Q-j  provide  a  rough  measure  of  the  expertise  of  the  expert.  However,  it  is  noted  that  the  probability 
distribution  of  the  combiner  may  be  multimodal,  which  may  impose  a  more  complicated  decision  strategy. 
The  linear  opinion  pool  has  been  considered  in  speaker  recognition  for  the  combination  of  features  [45], 
namely  cepstrum  and  delta  cepstrum  features. 

3.2  Log  Opinion  Pool 

An  alternative  to  the  linear  opinion  pool  is  the  log  opinion  pool.  If  the  a  weights  are  constrained  to  lie 
between  zero  and  one  and  sum  up  to  one,  then  the  log  opinion  pool  also  outputs  a  probability  distribution. 
However,  as  opposed  to  the  linear  opinion  pool,  the  output  distribution  of  the  log  opinion  pool  is  unimodal 
[52]. 

The  log  opinion  pool  consists  of  a  weighted  product  of  the  classifier  outputs: 

n 

Plogix)  =  (48) 

2  =  1 

Note  that  with  this  formulation,  if  any  expert  assigns  a  probability  of  zero,  then  the  combined  probability 
will  also  be  zero.  Hence,  an  individual  expert  has  the  capability  of  a  ‘veto”,  whereas  in  the  linear  opinion 
pool  the  zero  probability  would  be  averaged  in  with  the  other  probabilities. 

One  problem  that  both  the  linear  and  log  opinion  pools  are  subject  to  is  the  selection  of  the  weights 
ai-  Several  heuristic  solutions  [52]  to  this  are  to  1)  use  equal  weights,  i.e.,  =  \/n,  2)  use  weights 

proportional  to  a  ranking,  i.e.,  a  =  r/ X]?=i  or  3)  evaluate  the  weights  over  the  range  of  zero  to  one  for 
cross-validation  data  and  select  the  best  a^. 

3.3  Text-Independent  Speaker  Identification  Using  Data  Fusion 

Data  fusion  principles  are  used  to  combine  the  outputs  of  the  NTN  and  VQ  classifiers.  The  method  used 
here  is  the  linear  opinion  pool.  This  consists  of  evaluating  a  weighted  sum  of  the  classifier  outputs  as 
illustrated  in  Figure  14. 
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Here,  the  outputs  of  the  NTN  and  VQ  classifiers  are  multiplied  by  a  and  1  —  a.  respectively,  where  a 
lies  between  zero  and  one.  Hence,  when  a  =  0  the  system  consists  of  solely  the  VQ  classifier  and  likewise 
when  a  =  1.  only  the  NTN  is  used. 

To  enable  the  VQ  and  NTN  classifiers  to  be  used  in  such  a  system,  several  normalization  steps  must 
be  used.  First,  the  VQ  distortion  and  NTN  confidence  must  be  converted  to  the  same  scale.  Here,  the  VQ 
distortion  and  NTN  confidence  are  normalized  to  lie  on  a  scale  of  0  to  1,  where  1  denotes  a  perfect  match. 
These  scaled  scores  are  analogous,  though  not  equivalent,  to  probabilities. 

The  VQ  distortion  is  normalized  as  follows: 

Pvg(xlSj)  =  (49) 


where  Ci  is  the  centroid  closest  to  x.  The  NTN  label  and  confidence,  which  lies  between  0.5  and  1.0,  can 
be  normalized  to  a  single  score  as  follows: 


0.5  *  (1.0  +  confidence),  if  label  =  1 
0.5  *  (1.0  —  confidence),  if  label  =  0 


(50) 


These  scores  can  now  be  combined  and  evaluated  for  speaker  recognition  tasks. 


3.3,1  Confidence  Measures 

The  overall  confidence  measure  is  calculated  as  a  weighted  linear  combination  of  three  individual  measures. 
The  first  measure  is  based  on  the  mismatch  in  the  SNR  of  the  training  and  testing  data.  The  second  measure 
is  based  on  the  channel  mismatch  between  the  training  and  testing  data.  The  third  measure  is  based  on 
the  amount  of  training  and  testing  time. 


SNR  Mismatch 

Any  SNR  mismatch  between  the  training  and  testing  data  results  in  a  degradation  in  the  confidence 
with  which  a  decision  can  be  based.  A  confidence  level  is  computed  based  on  the  determinations  of  the 
SNR  of  the  training  and  testing  data.  The  absolute  value  of  the  difference  in  the  SNR  values  is  found  as 
the  SNR  mismatch.  The  confidence  level  is  computed  from  this  mismatch. 

Speaker  identification  experiments  for  various  SNR  mismatches  were  conducted  to  get  an  identification 
success  rate.  Prom  a  discrete  set  of  points  relating  the  success  rate  to  the  SNR  mismatch,  a  continuous 
functional  fit  is  obtained.  This  function  is  specified  as  one  of  two  fourth  order  polynomials  depending  on  the 
SNR  mismatch.  During  actual  operation,  the  function  value  is  calculated  after  finding  the  SNR  mismatch. 
This  value  is  the  expected  success  rate  or  equivalently  the  confidence  level.  The  memory  requirement 
consists  of  only  the  polynomial  coefficients. 


Channel  Mismatch 

A  channel  mismatch  between  the  training  and  testing  data  results  in  a  degradation  in  the  confidence 
with  which  a  decision  can  be  based.  A  confidence  level  is  computed  based  on  a  quantitative  determination 
of  the  channel  mismatch  between  the  training  and  testing  data.  Let  Ctr  ne  the  cepstral  mean  vector  for 
the  training  data.  Similarly,  let  Ctt  be  the  cepstral  mean  vector  for  the  test  data.  The  channel  mismatch  is 


20  log 


||Cir 


(51) 


The  confidence  level  is  computed  from  this  mismatch. 

Speaker  identification  experiments  for  various  channel  mismatches  were  conducted  to  get  an  identi¬ 
fication  success  rate.  From  a  discrete  set  of  points  relating  the  success  rate  to  the  channel  mismatch. 
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Confidence  Level  for  SNR  Mismatch 


Figure  15:  SNR  confidence 

a  polynomial  fit  is  obtained.  During  actual  operation,  the  polynomial  value  is  calculated  after  finding 
the  channel  mismatch.  This  value  is  the  expected  success  rate  or  ecpiivalently  the  confidence  level.  The 
memory  requirement  consists  of  only  the  polynomial  coefficients. 

Training/Testing  Time 

For  a  given  training  time  of  3  seconds,  the  identification  success  rate  is  found  for  testing  times  up  to 

3  seconds.  From  this,  a  functional  fit  relating  the  success  rate  or  the  confidence  level  to  the  testing  time 
for  a  given  training  time  is  obtained.  A  family  of  functions  based  on  this  fit  is  derived  for  various  training 
times.  Therefore,  knowledge  of  the  training  and  testing  times  will  yield  the  confidence  level  by  evaluation 
of  one  of  the  members  of  the  family  of  functions.  When  the  testing  time  exceeds  3  seconds,  it  is  assumed 
to  be  3  seconds  for  the  purposes  of  finding  the  confidence.  With  regard  to  memory,  it  suffices  to  store  the 
function  originally  obtained  for  a  training  time  of  3  seconds. 

4  Word  Spotting 

Word  spotting  is  the  process  of  locating  a  predefined  keyword  utterance  within  a  continuous  speech  utter¬ 
ance.  Word  spotting  is  a  subset  of  the  general  speech  recognition  task  consisting  of  a  limited  vocabulary  of 
keywords  and  a  method  of  detecting  words  not  in  the  vocabulary.  Training  and  modeling  of  word  spotting 
systems  can  fall  under  two  categories,  speaker  dependent  and  speaker  independent.  Speaker  dependent 
systems  recognize  spoken  words  from  a  specific  speaker.  Speaker  independent  systems  are  trained  from 
speakers  from  a  sample  population  and  are  expected  to  generalize  to  other  populations. 

Early  work  in  the  area  of  word  recognition  concentrated  on  methods  based  on  [2]  [3]  [4]  [13]  [24] [6]  elastic 
template  matching  schemes  using  extracted  speech  features  with  a  dynamic  programming  algorithm  [1], 
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Figure  16:  Training/testing  time  confidence 


Some  of  these  systems  avoided  the  problem  of  detecting  the  word  boundaries  by  either  using  a  separate 
boundary  detection  scheme  or  requiring  the  utterance  to  contain  only  a  single  word.  A  template  matching 
scheme  transforms  a  segment  of  a  speech  utterance  into  a  multidimensional  vector.  These  temporally 
aligned  vectors  are  used  to  construct  a  reference  template.  The  reference  template  is  then  used  to  classify 
unknown  speech  to  the  nearest  template  based  on  a  distance  metric.  Related  systems  using  Hidden  Markov 
Models  for  isolated  word  recognition  grew  out  of  the  research  in  template  matching  systems  [14][15][27]. 

Word  spotting  from  continuous  unconstrained  speech  allows  a  more  flexible  user  interface,  transferring 
the  burden  of  word  boundary  detection  from  the  user  to  the  machine  for  ’’hands-free”  machine  control 
applications.  Other  applications  of  continuous  word  spotting  systems  include  monitoring  keywords  within 
transmitted  radio  communications,  voice  activated  calling  and  telephone  routing  systems.  Early  studies 
on  automatic  operator  assisted  telephone  applications  showed  that  approximately  20%  of  the  callers,  when 
asked  to  give  an  isolated  word  response,  added  extraneous  speech  within  the  response  [24]  [26]  [27].  Clearly 
a  robust  automated  system  must  include  methods  to  disregard  extraneous  speech. 

Many  of  the  these  speech  recognition  systems  avoid  practical  issues,  such  as  real-time  or  fast  response 
of  the  system  to  a  spoken  word.  The  approach  taken  by  a  few  systems  focused  on  a  low  level  parallelism  of 
the  underlying  recognition  system  [6] [23],  but  many  of  the  practical  issues  axe  neglected  for  constructing  a 
real-time  recognition  algorithm.  Some  work  on  using  only  a  partial  Viterbi  backtrace  during  the  recognition 
process  in  an  HMM  system  allows  faster  performance  without  excessive  loss  of  accuracy  [18]  [22],  but  the 
algorithm  still  requires  excess  speech  lag  time  for  the  backtrace,  hindering  a  real-time  response. 

4.1  Neural  Network  Based  Systems 

Approaches  using  neural  networks  have  also  been  suggested  for  word  and  phoneme  recognition  applica¬ 
tions.  Incorporation  of  neural  networks  into  speech  recognition  systems  allows  the  use  of  discriminative 
training  to  enhance  the  recognition  performance.  These  systems  range  from  those  entirely  based  on  neural 
network  technology  to  hybrid  approaches.  Some  of  these  systems  use  neural  networks  directly  for  subword 
recognition  [21] [19]  [29],  or  the  construction  of  whole  word  models  [10].  Hybrid  approaches  combine  multi¬ 
ple  neural  networks  or  other  classifiers  systems  in  a  hierarchical  post  processing  approach  [11]  [12]  [28]  or  a 
data  fusion  approach  [8]  [20],  combining  the  outputs  of  multiple  classifiers. 
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4.2  Replacing  Gaussian  Mixture  Models  in  an  HMM  with  CDNTN  Models 

The  context  dependent  models  improves  discrimination  between  subword  models.  By  combining  the  NTN 
and  the  Gaussian  mixture  model  presently  used  in  the  HMM  to  model  the  output  probabilities,  the 
discrimination  ability  of  the  NTN  will  improve  the  performance.  The  new  CDNTN  model  provides  one 
example  of  combining  the  two  models.  By  using  the  CDNTN  to  model  the  posterior  probability  of  a  subword 
segment  within  a  keyword,  the  subword  models  can  be  connected  together  to  form  a  Markov  chain.  For 
word  spotting  applications  the  discriminate  training  data  of  the  CDNTN  model  can  be  constructed  to  allow 
better  discrimination  between  subword  models  within  a  keyword.  One  method  is  to  apply  discriminate 
training  between  subword  state  models.  Feature  vectors  common  to  one  subword  difficult  to  separate  from 
vectors  assigned  to  other  subwords,  will  be  grouped  together  by  the  NTN  in  a  region  with  low  confidence 
or  low  posterior  probability.  Feature  vectors  that  are  well  separated  from  those  assigned  to  other  subwords, 
will  be  grouped  by  the  NTN  into  high  confidence  regions.  This  can  allow  a  natural  partitioning  of  the 
subword  data  so  that  complex  distributions  can  be  more  accurately  modeled. 

Once  a  Markov  chain  is  created  for  each  keyword,  state  durations  can  be  extracted  from  the  training 
data  and  clustered  to  form  a  state  duration  template  to  non-parametrically  model  the  durations  of  each 
state.  For  testing,  a  dynamic  time  warping  algorithm  can  be  used  to  evaluate  the  state  outputs  for  the 
test  utterance  against  the  state  duration  model  extracted  from  the  training  data.  The  state  duration 
template  provides  a  temporal  model  for  the  state  outputs  during  a  keyword  occurrence  distinguishing 
between  random  state  outputs  and  temporally  aligned  outputs  driring  a  keyword  occurrence  obviating  the 
need  for  a  recognition  network. 

4.2.1  Feature  Extraction 


Mel-warped  Cepstrum 

Spectral  analysis  using  a  bank  of  filters  is  a  well  known  front-end  processor  for  speech  recognition 
systems.  Generally  the  filter  bank  is  non-uniform  and  the  spacing  of  the  filters  is  based  on  critical  bands 
proposed  by  perceptual  studies.  The  critical  band  is  almost  linear  for  frequencies  below  IOOOR2,  and  is 
almost  logarithmic  for  frequencies  above  IQOOHz.  Mel  scale  is  an  approximation  to  the  critical  band  scale. 
The  relationship  between  frequency  /  (inkHz)  and  the  mel  scale  is  approximated  by  the  following  equation 

mel  =  1000  log2(l  +  /)  (52) 

Thus,  the  mel-scale  filter  bank  has  filters  spaced  uniformly  on  a  mel-frequency  scale.  Generally,  the 
individual  filters  have  a  triangular  bandpass  frequency  response,  and  the  spacing  between  the  filters  as  well 
as  their  bandwidths  are  determined  by  a  constant  mel  frequency  interval.  An  overlap  equal  to  half  the 
bandwidth  is  typically  present  between  adjacent  liters.  A  mel-scale  filter  bank  is  illustrated  in  figure  17 
Suppose  a  mel-scale  filter  bank  of  Q  filters  is  designed,  and  the  magnitude  outputs  within  each  filter  is 
rrii  {i  =  1,  -  ,  Q).  These  magnitude  outputs  rrii  are  then  converted  into  mel-frequency  cepstral  coefficients 

(MFCC)  Cj  {j  =  1,  -  ,  M)  by  applying  the  discrete  cosine  transform  as  follows 

Q 

Cj  =  cos(^(i -0.5))  (j  =  l,---,M)  (53) 

i=l  ^ 

where,  M  is  the  cepstral  order. 

4.3  Word  Modeling 

CDNTN 
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Figure  17:  A  mel-scale  filter  bank,  with  constant  mel  frequency  bandwidth  and  overlap  equal  to  half  the 
bandwidth 

4.3.1  CDNTN  Subword  State  Model  Description 

The  CDNTN  modeling  mechanism  is  formed  by  blending  the  NTN  with  a  mixture  model  used  in  an  HMM. 
This  can  be  described  in  HMM  terms  as  constructing  multiple  parallel  states  for  each  subword  model.  This 
is  depicted  in  Figure  18  by  considering  that  each  feature  vector  presented  to  the  subword  model  is  first 
passed  through  an  NTN  network,  which  determines  which  parametric  model  will  be  used  to  generate  the 
output  probability  for  the  subword.  This  method  allows  parallel  parametric  models  to  be  naturally  formed 
for  multiple  vocalizations  of  the  same  subword  from  different  speakers.  Each  vocalization  is  then  weighted 
by  the  confidence  of  that  sound’ as  a  representative  example  from  the  training  data  base.  The  separation  of 
the  parallel  parametric  models  is  accomplished  by  the  NTN  by  minimization  of  the  cost  function  used  to 
grow  the  tree.  For  the  application  in  the  word  spotting  system,  this  cost  function  chooses  the  most  likely 
subword  sound  with  respect  to  all  the  possible  subword  sounds.  Those  nearest  in  feature  space  are  grouped 
together  by  the  NTN  hyperplanes.  Once  the  NTN  decides  which  substate  model  to  use,  the  parametric 
model,  weighted  by  the  confidence  of  that  region  of  feature  space,  is  chosen  by  the  multiplexer  as  the 
subword  probability. 

4.3.2  Modeling  HMM  State  Durations  using  Dynamic  Time  Warping 

As  an  alternative  to  modeling  state  durations  and  state  transitions  within  an  HMM  model  by  transition 
probabilities,  some  HMM  systems  have  been  reported  which  replace  the  transition  probabilities  with  explicit 
state  duration  models  [5]  [16]  [17].  These  state  duration  models  are  usually  built  into  the  HMM  network 
as  a  parameterized  function  of  time,  and  relax  the  constraints  of  the  Markov  property  of  the  probability 
of  state  occupation.  As  an  alternative  to  parametric  modeling  of  the  state  durations,  a  non  parametric 
method  was  developed  by  explicitly  modeling  the  state  occupations,  and  creating  a  template  model  which 
uses  dynamic  time  warping  (DTW)  to  warp  the  speech  to  account  for  time  rate  variations. 


31 


Feature  Vector 


Sub  state 
Models 


Figure  18:  CDNTN  Subword  State  Model 

4.3.3  Dynamic  Time  Warping 

Dynamic  time  warping  is  the  process  of  applying  a  more  general  dynamic  programming  algorithm  to  opti¬ 
mally  time  align  a.  sequence  of  vectors  to  a  reference  sequence.  Dynamic  programming  was  first  described 
by  Bellman  [1]  in  1957  as  a  method  for  the  solution  of  multi-stage  decision  processes.  These  decision  pro¬ 
cesses  were  described  by  a  sequence  of  physical  systems  where  each  stage  was  characterized  by  a  small  set 
of  state  variables  or  parameters.  At  each  stage  there  is  a  choice  of  a  number  of  decisions.  An  assumption 
is  made  that  the  solution  of  the  present  state  is  independent  of  the  past  solutions.  The  dynamic  program¬ 
ming  algorithm  maximizes  some  function  on  the  state  variables  by  making  locally  optimum  decisions.  As  a 
discrete  example,  each  state  of  the  system  can  be  described  by  a  vector  whose  components  vary  with  each 
discrete  time  step.  Dynamic  time  warping  applies  the  principles  of  dynamic  programming  to  non  linearly 
warp  the  time  axis  of  sequence  of  vectors  to  optimally  match  a  reference  set.  This  warping  process  has  the 
effect  of  stretching  or  compressing  the  time  sequence  of  vectors  to  provide  the  best  match  to  a  reference 
template. 

Given  a  multi-dimensional  feature  vector  that  varies  with  time,  a  template  is  created  by  ordering  each 
reference  template  in  an  array  for  a  fixed  duration.  The  template  duration  or  length  is  typically  the 
average  duration  of  a  keyword.  Typically  the  template  is  comprised  of  a  set  of  multi-dimensional  cepstral 
coefficients  derived  within  a  fixed  time  frame.  This  template  is  used  with  a  test  frame  of  unknown  speech 
features  to  create  a  search  grid  for  the  DTW  search  depicted  in  Figure  19.  In  Figure  19  the  j  axis  is  defined 
as  time  samples  for  the  reference  template  and  the  i  axis  as  the  time  samples  for  the  test  pattern.  The 
reference  pattern  of  length  J  is  placed  against  the  j  axis  and  the  test  pattern  of  length  I  is  placed  along  the 
i  axis.  The  DTW  algorithm  finds  the  locally  optimal  match  in  time  between  the  reference  template  and 
test  pattern.  At  each  grid  point  along  a  row  defined  by  the  reference  template,  the  distance  between  the 
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Figure  19:  Test  and  reference  patterns  for  DTW  template  matching 


two  templates  are  calculated.  The  grid  point  corresponding  to  the  nearest  distance  between  the  template 
samples  is  selected  as  the  locally  optimum  match,  and  the  algorithm  is  repeated  until  the  final  match 
point  (LJ)  is  reached.  At  each  match  point,  the  distance  is  accumulated  and  when  the  final  match  point  is 
reached  the  accumulated  distance  score  is  available.  The  optimal  search  path  is  represented  in  Figure  19 
by  the  bold  lines  through  the  grid.  Many  variations  and  constraints  have  been  developed  [5]  to  constrain 
the  search  within  reasonable  limits  to  avoid  unnatural  jumps  in  time. 

4.3.4  Non  Parametric  State  Duration  Models  Using  Dynamic  Time  Warping 

The  use  of  dynamic  time  warping  to  find  the  optimal  sequence  of  state  outputs  in  a  Markov  chain  of  states  is 
similar  to  a  Viterbi  search  with  the  exception  that  not  only  are  the  scores  for  the  highest  output  states  used 
to  generate  a  likelihood  measure,  but  the  algorithm  also  accumulates  distance  scores  for  states  that  should 
predict  a  low  probability  of  state  occupation.  Figure  20  shows  how  durational  models  can  be  combined  with 
the  CDNTN  subword  models.  In  Figure  20,  each  subword  model  is  comprised  of  a  single  CDNTN  model 
trained  to  output  the  posterior  probability  of  that  subword  occurring.  The  actual  subword  model  is  not 
restricted,  and  a  discrete  or  continuous  model  can  also  be  used.  The  structure  of  the  model  in  Figure  20,  is 
identical  to  that  used  for  a  hidden  Markov  model,  except  the  transition  probabilities  in  the  hidden  Markov 
model  are  replaced  by  explicit  duration  probability  models.  The  duration  of  the  subword  is  determined  by 
comparing  the  state  outputs  to  the  duration  model  for  that  state  using  a  dynamic  programming  algorithm. 
The  state  duration  templates  can  be  obtained  directly  from  the  subword  durations  in  the  training  data 
set.  A  set  of  pre-derived  state  models  can  be  used  with  the  training  utterance  to  generate  a  set  of  duration 
templates. 
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Figure  20:  CDNTN  Word  Model 


4.3.5  Nonparametric  State  Duration  Modeling  for  Word  Spotting 

Conventional  DTW  word  spotting  systems  typically  use  cepstral  feature  vectors  for  constructing  the  refer¬ 
ence  templates  and  test  patterns  [5].  An  alternative  to  these  meth-  ods  is  to  use  the  HMM  state  emission 
probabilities  as  template  components  in  order  to  model  the  state  durations.  It  is  possible  to  construct  a 
vector  consisting  of  the  state  phone  models  within  a  keyword  for  each  time  unit  as  the  reference  template 
values.  To  create  a  reference  template,  a  phoneme  based  HMM  word  model  can  be  constructed  in  the 
manner  outlined  in  the  previous  HMM  based  word  spotting  system.  Once  the  state  models  have  been 
parameterized,  the  feature  vectors  created  from  the  keyword  utterances  used  in  the  training  data  set  can 
be  passed  through  each  phoneme  model  to  generate  a  family  of  matri-  ces  consisting  of  state  emission 
probabilities  as  a  function  of  time.  These  templates  can  then  be  time  aligned  and  averaged  for  all  the  key¬ 
word  utterances  in  a  similar  manner  to  [25]  to  create  a  single  reference  template.  This  reference  template 
contains  the  average  state  emission  probabilities  for  each  phoneme  within  the  keyword  as  a  function  of 
time. 

Using  a  DTW  state  duration  model  approach,  minimizes  the  difference  between  a  test  template  made 
from  every  state  model  at  each  time  instant  to  the  reference  template.  This  distance  is  not  only  penalized 
by  the  most  likely  state  having  a  low  output  probabil-  ity,  it  also  accounts  for  any  state  that  has  a  different 
output  than  the  template  pattern.  This  method  uses  the  additional  information  of  certain  states  having 
low  output  emission  prob-  abilities  at  the  same  time  as  others  having  high  outputs.  In  a  phoneme  based 
system,  this  corresponds  to  penalizing  a  word  model  that  predicts  multiple  phonemes  having  high  output 
probabilities  at  the  same  time,  when  the  models  should  predict  a  single  phonetic  candidate.  For  a  system 
using  discriminative^  trained  state  models,  the  use  of  all  the  state  outputs  maximizes  the  use  of  the  training 
information  where  some  states  are  parameter-  ized  to  predict  low  probabilities  for  some  feature  vectors. 
This  is  in  contrast  to  the  Viterbi  algorithm  used  in  the  HMM  classifier  where  this  information  is  lost  by 
jumping  to  the  state  with  the  peak  likelihood  in  the  search  path.  Also,  the  conventional  HMM  classifier 
models  the  state  durations  by  multiplying  each  duration  output  by  connecting  transition  probabilities.  The 
Viterbi  algorithm  uses  the  same  basic  method  as  described  for  the  DTW  system  in  1.4,  except  that  the  test 
template  is  replaced  by  a  simple  time  index,  and  can  be  considered  as  a  subset  of  the  more  general  dynamic 
programming  algorithm.  The  best  path  is  determined  by  the  maximum  log  likelihood  accumulated  through 
the  grid  where  each  transition  is  multiplied  by  the  appropriate  transition  probability. 

4.4  The  CDNTN  Word  Spotting  System 

A  new  word  spotting  system  was  developed  using  a  CDNTN  model  for  modeling  phoneme  output  emis¬ 
sion  probabilities  within  an  HMM  framework.  Subword  durational  probabilities  were  modeled  using  the 
nonparametric  DTW  template  method  described  earlier. 
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4.4.1  Training  the  CDNTN  Word  Spotting  System 

As  preliminary  tests  of  the  proposed  word  spotting  system,  initial  state  segmenta-  tion  of  the  phoneme 
boundaries  of  the  keywords  for  the  Road  Rally  speech  Corpus  was  done  using  an  HMM  forced  alignment. 
This  was  accomplished  by  training  a  serial  network  of  three  state  HMM  triphone  models  over  all  the 
utterances  for  each  keyword.  Once  the  HMM  training  was  complete,  the  utterances  were  passed  through 
the  HMM  networks  and  each  feature  vector  was  labeled  according  to  the  most  likely  phoneme  model  using 
a  Viterbi  search.  This  method  was  used  to  force  a  phonetically  aligned  segmentation  for  each  keyword 
within  the  Waterloo  section  of  the  Road  Rally  Speech  corpora.  The  forced  alignment  method  can  be 
replaced  by  one  solely  based  on  the  CDNTN,  and  is  described  in  [20].  A  CDNTN  segmental  alignment 
procedure  was  implemented  and  exper-  imental  results  showed  a  highly  accurate  segmentation  is  possible, 
when  tested  on  a  data  base  of  phonetically  segmented  speech  [20]. 

The  Road  Rally  Speech  Corpora  was  used  to  train  and  test  the  system.  This  data  base  consists  of  two 
independently  recorded  sections  of  different  speakers  from  different  dialects.  The  training  section,  known  as 
the  Waterloo  Corpus  is  made  up  56  speakers,  28  male  and  28  female,  reciting  a  paragraph  about  planning 
a  road  rally  recorded  through  an  actual  telephone  system  sampled  at  lOKH/  and  filtered  through  a  300Hz 
to  3300Hz  PCM  FIR  bandpass  filter.  The  total  duration  of  the  Waterloo  section  consists  of  approximately 
two  hours  of  read  speech.  Marking  files  for  20  keywords  are  provided,  which  specify  loca-  tions  of  the 
20  keywords  within  the  speech  files.  A  separate  test  corpus  called  the  Stonehenge  section,  is  made  up  of 
free  unrestricted  conversational  speech  independently  recorded  between  two  speakers  planning  a  road  rally. 
The  Stonehenge  speech  was  recorded  on  high  quality  microphones,  and  filtered  using  a  300Hz  to  3300Hz 
PCM  FIR  bandpass  filter  to  simulate  telephone  bandwidth  quality. 

4.4.2  Growing  CDNTN  State  Models 

Once  the  speech  data  is  phonetically  segmented  according  to  a  phonetic  dictionary,  the  phoneme  segmen¬ 
tations  were  used  to  define  subword  states  for  discriminatively  train-  ing  a  CDNTN  to  predict  the  posterior 
probability  of  a  subword  given  a  training  vector.  The  anticlass,  data  used  for  each  subword  CDNTN  model 
were  the  remaining  subword  vectors  within  the  keyword  labeled  as  not  belonging  to  the  subword  being 
modeled.  This  amounts  to  training  each  subword  model  to  predict  the  posterior  output  probability  given  a 
feature  vector  with  respect  to  the  other  subwords  within  the  keyword.  CDNTN  trees  were  grown  for  each 
phoneme  occurring  in  each  of  the  20  keywords  in  the  Road  Rally  Speech  corpora.  A  total  of  122  CDNTN 
trees  were  grown  to  model  the  subwords  for  the  20  keywords. 

By  restricting  the  training  data  used  to  develop  the  subword  models  to  only  phonetic  data  from  that 
ke3rword,  the  CDNTN  based  keyword  model  can  be  used  to  define  the  most  probable  subword  segmentations 
given  a  sequence  putative  keyword  feature  vectors.  Alternate  strategies  for  constructing  the  training 
data  set  have  been  tried,  adding  confused  words  and  randomly  selecting  alternate  keyword  data  as  anti¬ 
class  data.  These  methods  were  found  to  provide  no  substantial  improvement  in  the  keyword  spotting 
system.  A  fundamental  obstacle  faced  in  using  a  discriminative  classifier  to  construct  subword  models 
is  that  of  defining  an  appropriate  training  set.  A  trade-off  exists  between  trying  to  construct  a  global 
phonetic  model  using  vast  amounts  of  training  data  and  one  using  locally  appropriate  data.  Data  sets 
using  large  amounts  of  anti-class  data  can  mask  the  probability  distribution  of  the  specific  phoneme  model 
by  artificially  introducing  a  prior  bias. 

Once  a  training  data  set  is  established  for  each  subword  model,  the  CDNTN  is  grown  using  an  LI 
cost  function.  Pruning  is  necessary  to  allow  a  sufficient  representation  of  each  subword' class  for  within 
each  leaf  region.  The  percentage  based  forward  pruning  method  described  in  the  appendix,  combined 
with  terminating  growth  at  maximum  level  was  found  to  provide  the  best  results  for  the  word  spotting 
application.  The  incremental  procedure  of  increasing  the  number  of  mixtures  was  used,  using  a  hierarchical 
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Figure  21:  Sample  CDNTN  outputs  for  road  word  model 


k-means  algorithm. 

4.4.3  Construction  of  Subword  Duration  Templates 

Once  the  CDNTN  subword  models  were  created,  temporally  aligned  feature  vectors  corresponding  to  each 
keyword  utterance  from  the  training  data  base  were  applied  to  each  CDNTN  phoneme  model  to  obtain 
an  subword  state  output  vector  corresponding  to  state  emissions  as  a  function  of  time  for  the  keyword. 
Figure  21  shows  a  sample  sequence  of  the  CDNTN  state  models  for  an  utterance  of  the  keyword  road 
by  a  speaker  in  the  Waterloo  section.  Similar  outputs  were  obtained  for  each  keyword  utterance  for  all 
keywords  in  the  Waterloo  section.  Figure  21  illustrates  how  a  single  speaker  dependent  dura-  tion  template 
can  be  constructed  from  a  single  utterance  of  a  keyword.  To  obtain  a  general  speaker  independent  duration 
template,  it  is  necessary  to  derive  a  template  for  each  speaker  in  the  training  data  base,  and  average  the 
templates  to  form  a  general  durational  model. 

A  clustering  method  [25]  was  used  to  combine  the  duration  templates  generated  from  each  keyword 
utterance  in  the  Waterloo  section  using  by  a  time  aligned  clustering  method.  The  template  clustering 
algorithm  creates  a  template  representing  an  average  of  individual  templates  time  warped  to  the  template 
with  the  minimum  distance  to  every  other  template.  The  algorithm  can  be  briefly  described  by  constructing 
a  two-dimensional  grid,  where  each  grid  location  i.j  stores  a  distance  d{xi,Xj)  between  each  subword 
duration  pattern  x  for  each  speaker.  The  distance  metric  d(xi,Xj),  is  defined  as  the  distance  obtained 
by  dynamic  time  warping  vector  Xi  with  Xj,  using  a  euclidean  distance  metric.  Once  the  distance  grid  is 
computed,  the  duration  vector  which  is  found  to  have  the  smallest  average  distance  to  every  other  duration 
template  is  defined  as  the  center  template.  Once  the  center  template  is  found,  each  remaining  duration 
vector  is  again  time  warped  to  align  the  sequences,  and  an  average  template  is  computed  by  averaging  the 
time  aligned  duration  outputs.  Figure  22  shows  the  average  template  for  the  keyword  road  and  Figure 
23  shows  another  template  for  the  keyword  secondary.  As  can  be  seen  in  Figure  22  and  Figure  23,  the 
CDNTN  provides  a  relatively  good  model  for  the  subwords  across  ail  the  Waterloo  speakers.  A  poor 
subword  model  will  result  in  a  blurred  average  template.  The  duration  template  can  be  thought  of  as  a 
matched  filter  for  the  subword  outputs  during  a  keyword.  If  a  non  keyword  is  presented  to  the  system,  the 
subword  models  will  output  low  random  probabilities  which  will  in  turn  produce  distortion  values  during 
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Figure  22:  Averaged  duration  template  for  road  state  emissions 


the  template  match. 

To  show  the  separation  of  distortion  scores,  an  isolated  word  identification  test  was  performed  on  a 
limited  number  of  keywords  in  the  Stonehenge  section  of  the  Road  Rally  Corpus.  Figure  24  shows  the 
distortion  values  for  testing  isolated  utterances  of  retrace,  conway,  and  road  using  the  CDNTN  duration 
template  for  road.  A  clear  distinction  can  be  seen  in  distortion  levels  obtained  for  road  as  compared  to 
conway  and  retrace. 

4.4.4  CDNTN  Word  Spotting  System  Description 

To  construct  a  word  spotting  system  capable  of  real  time  performance,  the  DTW  template  matching 
scheme  was  performed  in  parallel  for  each  word  template  by  fixing  a  test  template  length  based  on  the 
average  template  duration  of  each  keyword.  A  sequence  of  output  scores  are  generated  by  sliding  this 
template  along  a  stream  of  state  emission  probabilities  generated  from  a  continuous  stream  of  speech  data 
parameterized  into  27  dimensional  feature  vectors  and  applied  to  the  CDNTN  model  state  generators. 
Figure  25  shows  a  diagram  of  the  word  spotting  system  for  a  single  keyword.  This  new  word  spot-  ting 
system  differs  from  the  previous  HMM  based  system  described  earlier  in  that  no  background  filler  model 
is  needed  and  putative  keyword  locations  can  be  found  indepen-  dently  without  a  network  structure. 

Construction  of  independent  word  spotting  systems  for  each  keyword  allows  the  system  to  scale  indepen¬ 
dently  to  the  number  of  keywords,  also  allowing  a  simple  parallelization  of  the  system  on  a  multi-processor 
network.  A  multi-word  system  can  be  made  by  running  each  independent  keyword  spotting  system  in 
parallel  and  combining  the  scores  by  a  method  which  chooses  the  lowest  score,  shown  in  Figure  26. 

Each  output  word  score  in  Figure  25  is  compared  to  a  local  keyword  threshold  to  determine  a  putative 
keyword  occurrence.  A  distance  threshold  can  be  extracted  from  histogram  based  on  the  scores  of  the 
keyword  spotting  system  on  a  cross-validation  data  set  for  both  correct  hits  and  false  alarms.  The  threshold 
location  can  then  be  found  by  choosing  a  value  between  the  two  distributions.  An  alternate  method  can 
use  the  results  directly  from  the  ROC  curves  for  each  keyword,  to  define  the  recognition  level  at  a  partic¬ 
ular  false  alarm  rate. 

To  obtain  ROC  scores  a  simple  dynamic  threshold  was  used.  The  dynamic  running  average  of  the 
output  score  was  used  as  a  threshold  to  output  putative  hit  locations  for  construction  of  the  ROC  table. 
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Figure  25:  CDNTN  word  recognition  system  for  a  single  word 


speech 


Figure  26:  Multi  word  CDNTN  word  spotting  system 


The  step  size  of  the  sliding  template  was  set  to  lOduration  template  time  for  the  entire  keyword  utterance. 
The  short  sliding  window  step  size  results  in  multiple  low  adjacent  scores  as  the  template  passes  through 
a  keyword.  Once  the  list  of  putative  hits  are  obtained,  the  overlapping  putative  hits  were  merged  to 
form  a  single  hit.  To  allow  calculation  of  a  figure  of  merit,  the  sequence  of  putative  hits  obtained  from 
each  keyword  was  sorted  into  a  list  based  on  increasing  distortion  level.  A  figure  of  merit  for  recognition 
performance  was  used  to  evaluate  the  system  for  perfor-  mance  at  0  to  10  false  alarm  rates/keyword/hour 
in  the  same  fashion  described  in.  Since  the  CDNTN  word  spotting  system  uses  a  dynamic  threshold,  no 
attempt  is  made  to  limit  the  number  of  false  alarms  with  low  keyword  output  scores. 

Figure  27  shows  a  sample  output  of  the  duration  distortion  for  the  keyword  ’’spring-  field”  when  the 
utterance  ’’take  the  primary  interstate  west  into  Springfield”  is  spoken.  The  figure  shows  the  distortion  of 
the  duration  template  algorithm  as  a  function  of  featxire  sample.  The  keyword  occurs  in  the  utterance  at 
the  negative  peak  approximately  at  sample  325,  and  can  be  easily  extracted  by  simply  thresholding  the 
output  distortion. 
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5  Results 

5.1  Results  for  Testing  on  10  Male  Speakers 

The  training  algorithm  for  the  NTN  uses  a  sequential  backpropagation  update  rule  for  training  the  per- 
ceptrons  at  each  node.  This  means  that  the  perceptron  weights  are  updated  after  each  feature  vector. 
To  prevent  ordering  effects  of  the  data  from  producing  a  bias  in  the  gradient  decent,  the  training  vectors 
are  initially  randomly  ordered.  Each  new  randomization  allows  a  different  local  minimum  to  be  found 
in  the  search  space.  To  allow  multiple  minimums  to  be  evaluated,  the  multiple  weights  are  trained  for 
each  subword  state  model  and  tested  against  a  cross-validation  set  to  select  the  most  optimum  models. 
The  NTN  growing  algorithm  is  also  implemented  with  a  mean  splitting  initialization  of  the  perceptron 
to  allow  fast  convergence.  A  minimum  number  of  epochs  for  the  percep-  tron  training  was  set  a  200  to 
insure  a  minimum  search  time.  A  momentum  term  was  added  to  the  backpropagation  algorithm  to  speed 
convergence  of  0,4,  for  a  step  size  of  0.5.  Convergence  was  assumed  when  the  difference  of  the  mean  square 
error  measured  every  10  epochs  was  less  than  0.0001.  During  testing,  the  prior  probabilities  of  each  class 
was  normalized  to  0.5  to  remove  the  bias  created  from  the  unequal  amounts  of  anti-class  data  and  in-class 
data,  by  the  method  described  in  the  appendix. 

Table  1  outlines  the  performance  of  the  CDNTN  model  word  spotting  system  for  testing  against  10 
male  speakers  in  the  Stonehenge  section  of  the  Road  Rally  Speech  Corpus.  The  features  used  were  MFCC 
coefficients  with  added  energy,  and  acceleration  terms,  with  the  mean  removed.  Each  subword  unit  in  this 
system  was  trained  using  a  5percentage  based  forward  pruning  method  with  a  maximum  tree  level  set  to 
seven,  as  described  in  section  appendix.  A  maximum  number  of  mixtures  created  for  both  in-class  and 
anti-class  vectors  in  each  leaf  was  limited  to  six.  An  iterative  k-means  clustering  method  was  used,  starting 
with  a  single  mixture  and  incrementally  increasing  the  number  dynamically  until  less  than  n  were  assigned 
to  each  cluster.  The  limiting  constant  n  was  chosen  as  the  dimension  of  the  feature  vector,  in  this  case 
27.  The  training  data  for  this  system  used  only  the  keyword  utterances  from  each  of  the  56  speakers  in 
the  Waterloo  section  of  the  Stonehenge  speech  corpora.  Each  recited  paragraph  from  the  Waterloo  section 
provided  99  keyword  tokens  for  training  per  speaker.  Discriminative  training  data  for  the  subword  units 
was  selected  as  the  feature  vectors  assigned  to  the  remaining  subwords  within  the  keyword.  Subsequently, 
each  keyword  model  can  be  trained  using  only  data  from  the  utterances  of  that  keyword  from  the  training 
data  base.  The  total  amount  of  keyword  tokens  used  for  training  this  system  was  5,544  for  all  20  keywords. 
This  averages  to  277.2  tokens/keyword  and  5  tokens /keyword/speaker. 

Using  the  DTW  scoring  method  with  a  fixed  sliding  window  allows  an  actual  distance  metric  to  be 
defined  for  putative  hits,  thus  a  threshold  can  be  set  for  each  keyword  according  to  Table  1.  As  can  be  seen 
in  the  Table  1,  the  performance  varies  between  keywords.  The  total  time  of  actual  speech  for  the  male  test 
is  approximately  44  minutes.  Given  this  limited  time,  data  error  rates  given  in  Table  1  are  interpolated 
from  the  first  5  false  alarms  encountered  in  the  ranked  keyword  lists. 

each  keyword,  male  only  test 

Figure  29  and  Figure  30  show  the  histograms  of  the  DTW  distance  scores  for  putative  hits  for  retrace 
and  secondary.  As  can  be  seen  in  the  figures,  the  false  alarm  distortion  measures  are  well  above  the 
majority  of  actual  keyword  scores.  In  these  figures  the  keyword  distortions  were  obtained  directly  from  the 
DTW  distance  found  between  the  test  state  duration  template  and  the  reference  state  duration  template. 
For  ROC  curve  estima-  tion,  the  threshold  was  set  dynamically  as  the  instantaneous  mean  of  the  of  the 
keyword  score.  This  allowed  a  coarse  threshold  for  simplified  scoring.  No  attempt  was  made  to  limit  the 
number  of  false  alarms  associated  high  distortion  levels.  This  can  be  seen  in  Figure  29  and  Figure  29  as 
high  number  of  false  alarms  on  the  right  side  of  the  histograms. 

Figure  32  shows  the  overall  performance  of  the  system  for  all  20  keywords  up  to  an  error  rate  of  10 
FA’s/Keyword/Hour  for  the  male,  female  and  cross-sex  tests.  As  can  be  seen  in  the  figure,  the  male 
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Figure  28:  Word  Spotting  Performance  for  CDNTN  trained  phonetic  subwords  for  each  keyword,  male 
only  test 
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RETRACE  Histogram 


Word  Score 


Figure  29:  Histogram  of  keyword  hits  (solid)  and  false  alarm  (dotted)  as  a  function  of  word  score  for  retrace 
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5EC0NDARV  Histogreon 


Word  Score 

Figure  30:  Histogram  of  ke3rword  hits  (solid)  and  false  alarm  (dotted)  as  a  function  of  word  scores  for 
secondary 
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Male  Test  (16  Kqwords) 

285 

8639 

331 

51.35% 

53.9% 

Figure  31:  CDNTN  Keyword  Spotting  Performance 
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Recognition  Rale  us.  FA/KWftlr 


Figure  32:  Overall  performance  for  20  keywords  vs.  FA /Keyword/ Hour 

speakers  scored  better  than  the  female  speakers.  This  is  similar  to  the  results  found  using  the  HMM 
word  spotting  system.  Table  2  outlines  the  results  for  male,  female  and  cross-sex  tests.  Since  dynamic 
thresholding  was  used  to  obtain  data  for  FOM  scoring,  no  attempt  was  minimize  false  alarms  at  with  low 
keyword  scores.  This  is  reflected  in  the  total  number  of  false  alarms  in  the  table.  The  high  number  of 
hits  in  Table  2  shows  that  this  system  can  benefit  greatly  from  a  post  processing  method  to  further  refine 
the  keyword  scores.  A  multi-level  classifier  system  such  as  described  by  [11]  [12]  [28],  has  great  potential 
for  re-scoring  false  putative  hits.  ROC  cmves  for  each  of  the  20  keywords  is  shown  in  Figure  33  for  male 
test.  This  figure  shows  that  a  large  majority  of  the  keywords  performed  well,  while  a  subset  of  kejnvords 
brought  down  the  average  score.  Figure  34  shows  the  average  performance  of  the  system  for  16  multi¬ 
syllable  keywords.  Since  no  explicit  background  model  is  used,  the  shorter,  simple  keywords  perform  much 
worse  that  the  longer  keywords.  This  is  a  direct  result  from  the  fact  that  the  longer  keywords  have  more 
subword  unit  models  which  are  more  difficult  to  fit  to  random  speech  by  the  template  matching  duration 
model. 

5.1.1  Performance  as  a  Function  of  Training  Parameters 

A  test  was  performed  measuring  the  performance  of  the  CDNTN  word  spotting  system  as  a  function  of 
the  maximum  number  of  mixtures  per  class  in  the  leaves.  Table  3  gives  the  performance  for  the  baseline 
system  using  a  5described  in  the  previous  section,  with  various  maximum  number  of  mixtures.  The  test 
was  made  on  a  cross  validation  set  for  12  male  speakers  using  conversational  data.  In  general,  both  HMM 
and  CDNTN  systems  performed  worse  on  this  test.  Table  3  shows  that  increasing  the  maximum  number 
of  mixtures  available,  increases  performance. 

Table  4  gives  the  performance  of  the  system  as  a  function  of  the  forward  pruning  threshold.  The  systems 


45 


D 

IVainingand  Test  Conditions 

#Hits 

#FAs 

#Actual 

FOM 

Hit  Rate 
for 

6FAs/ 

hour 

m 

Max  2  mixtures/class/leaf 

387 

47716 

482 

24.25% 

26.8% 

2 

Max  6  mixtures/class/leaf 

367 

42123 

482 

27.02% 

31.6% 

3 

Max  12  mixtures/class/leaf 

387 

43903 

482 

28.53% 

33.3% 

Figure  35:  CDNTN  keyword  spotting  performance  as  a  function  of  Gaussian  mixtures  on  male  cross 
validation  set 
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2 

Stop  @  <  5%  of  training  set 
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482 

27.02% 

31.6% 

3 

Stop  @  <  2.5%  of  training  set 

359 

39965 

482 

26.46% 

30.5% 

Figure  36:  CDNTN  keyword  spotting  performance  as  a  function  of  forward  pruning  percentage  level  on 
male  cross  validation  set 

in  the  table  were  trained  as  described  in  the  previous  section,  using  a  maximum  of  six  mixtures /class /leaf. 
Table  4  shows  that  peak  performance  on  the  male  cross  validation  set  was  obtained  at  5. 

5.1.2  CDNTN  Word  Spotting  Performance  with  Reduced  Data 

One  of  the  primary  advantages  of  using  a  discriminate  classifier  for  modeling  the  state  occupations  within 
a  HMM  model  is  the  use  of  the  additional  data  provided  by  the  anti-class  feature  vectors.  Discriminative 
training  maximizes  the  use  of  costly  training  data.  The  CDNTN  model  provides  an  effective  means  for 
blending  the  attributes  of  both  the  continuous  mixture  model  and  the  discriminative  neural  network.  The 
grouping  action  of  the  NTN  allows  a  efficient  parametric  model  to  be  made  for  the  vectors  in  each  region 
of  the  features  space.  The  separating  hyperplanes  defined  by  the  perceptrons.  partitions  the  feature  space 
into  high  and  low  confidence  regions.  A  minimal  number  of  exemplars  are  necessary  to  define  the  regions 
defined  by  the  NTN  leaves. 

To  measure  the  performance  of  the  CDNTN  system  as  a  function  of  training  tokens,  a  number  of  systems 
were  trained  varying  the  number  of  speakers  in  the  training  set.  Each  system  was  trained  using  the  identical 
training  parameters  of  Sing,  seven  maximum  NTN  levels,  and  six  maximum  mixtures/class/leaf.  Table  5 
gives  the  results  for  each  system. 

The  recited  paragraph  used  in  the  Waterloo  section  contains  99  keyword  tokens,  which  amounts  to  ap¬ 
proximately  5  tokens/key word/speaker.  Since  no  background  model  is  assumed,  no  extra  tokens  are  needed 
to  train  the  system.  This  considerably  reduces  the  cost  involved  in  training  the  system  in  terms  of  provid¬ 
ing  marked  speech  files  for  training  the  system.  For  many  applications  such  as  monitoring  keywords  from  a 
non  cooperative  subject,  large  amounts  of  speech  data  maybe  impossible  to  obtain.  The  new  word  spotting 
system  described  maximizes  the  use  of  available  tokens  by  the  CDNTN  state  models  to  obtain  superior 
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Figiire  37:  CDNTN  word  spotting  performance  for  varying  amounts  of  training  data  for  male  test 


Figure  38;  Comparison  of  CDNTN  to  HMM  word  spotting  system  with  limited  training  speakers  on  male 
test 

performance  to  comparable  HMM  systems.  Figure  38  shows  the  system  performance  of  the  CDNTN  word 
spotter  compared  to  two  HMM  systems.  The  top  HMM  system  using  all  the  available  training  data  from 
the  Waterloo  passage,  contains  a  total  of  321  tokens  for  training  both  the  keywords  and  the  background 
model  not  including  the  silence  model.  This  translates  to  approximately  16  tokens/key word /speaker. 
The  CDNTN  system  requires  only  the  keyword  data,  which  translates  to  approximately  5  tokens/  key¬ 
word/speaker  for  training,  with  no  extra  data  for  modeling  background  silence.  The  cross  over  point 
between  the  best  HMM  system  and  the  CDNTN  system  occurs  between  6  and  7  speakers.  Figure  38  also 
shows  the  performance  of  the  HMM  system  trained  using  only  pooled  keyword  data  for  both  keyword 
triphone  models  and  the  background  model,  except  for  the  silence  model  which  uses  background  silence 
between  keywords.  A  silence  model  is  required  for  an  HMM  system  to  achieve  non  trivial  performance. 
The  compara-  tively  trained  HMM  system,  shown  in  the  dot-dashed  line  in  Figure  39,  performed  worse 
than  the  CDNTN  system  when  less  than  25  speakers  were  used  training.  Figure  39  shows  the  performance 
for  the  different  systems  as  a  function  of  average  number  of  training  tokens  used  per  keyword.  The  token 
counts  do  not  include  the  tokens  used  to  generate  the  silence  models  for  the  HMM  systems. 
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Figure  39:  Comparison  of  CDNTN  to  HMM  word  spotting  system  as  a  function  of  training  tokens  on  male 

test 
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